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Abstract. Visual object recognition requires the matching of an image with a 
set of models stored in memory. In this paper we propose an approach to recognition 
in which a 3-D object is represented by the linear combination of 2-D images of the 
object. If M — {Mi, ...,Mjt} is the set of pictures representing a given object, and 
P is the 2-D image of an object to be recognized, then P is considered an instance 
of M if P = X^t=i a iMi for some constants a;. We show that this approach 
handles correctly rigid 3-D transformations of objects with sharp as well as smooth 
boundaries, and can also handle non-rigid transformations. The paper is divided 
into two parts. In the first part we show that the variety of views depicting the 
same object under different transformations can often be expressed as the linear 
combinations of a small number of views. In the second part we suggest how this 
linear combination property may be used in the recognition process. 
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Recognition by Linear Combinations of 

Models 



1 Modeling Objects by the Linear Combination of 
Images 

1.1 Recognition by Alignment 

Visual object recognition requires the matching of an image with a set of models stored 
in memory. Let M = {M u ...,M a } be the set of stored models, and P be the image 
to be recognized. In general, the viewed object, depicted by P, may differ from all the 
previously seen , mages of the same object. It may be, for instance, the image of a three- 
dimensional object seen from a novel viewing position. To compensate for these varia- 
.ons we may allow the models (or the viewed object) to undergo certain compensating 
transformations during the matching stage. If X is the set of allowable transformations 
the matching stage requires the selection of a model M, G M and a transformation 
1 € X, such that the v.ewed object P and the transformed model TM, will be as close 
as possible The general scheme is called the alignment approach, since an alignment 
reformation is applied to the model (or to the viewed object) prior to, or during 
he matching stage. Such an approach is used in [Chien k Aggarwal 1987 Faueeras t 
Hebert 1986, Fishier & Bolles 1981, Huttenlocher L Ullman 1987, Lowe 1985 Thomp sot 
k Mundy 1987, Ullman ,1986]. Key problems that arise in any alignment schemi r 7h ow 
to represent the set of different models M, what is the set of allowable transformations 
X, and, for a given model M, e M, how to determine the transformation T € X so as to 
minimize the difference between P and TM t . For example, in the scheme proposed by 
Basr, and Ullman [1988 a model is represented by a set of 2-D contours, with associated 
depth and curvature values at each contour point. The set of allowed transformations in- 
cludes 3-D rotation, translation and scaling, followed by an orthographic projection The 



transformation is determined as in [Huttenlocher &; Ullman 1987, Ullman 1986, 1989] by 
identifying at least three corresponding features (points or lines) in the image and the 
object. 

In this paper we suggest a different approach, in which each model is represented 
by the linear combination of 2-D images of the object. The new approach has several 
advantages. First, it handles all the rigid 3-D transformations, but it is not restricted 
to such transformations. Second, there is no need in this scheme to explicitly recover 
and represent the 3-D structure of objects. Third, the computations involved are often 
simpler than in previous schemes. 

The paper is divided into two parts. In the first (section 1) we show that the variety 
of views depicting the same object under different transformations can often be expressed 
as the linear combinations of a small number of views. In the second part (section 2) we 
suggest how this linear combination property may be used in the recognition process. 

1.2 Using Linear Combinations of Images to Model Objects 
and Their Transformations 

The modeling of objects using linear combinations of images is based on the following 
observation. For many continuous transformations of interest in recognition, such as 
3-D rotation, translation and scaling, all the possible views of the transforming object 
can be expressed simply as the linear combination of other views of the same object. 
The coefficients of these linear combinations often follow in addition certain functional 
restrictions. In the next two sections we show that the set of possible images of an object 
undergoing rigid 3-D transformations and scaling is embedded in a linear space, spanned 
by a small number of 2-D images. 

The images we will consider are 2-D edge maps produced in the image by the (ortho- 
graphic) projection of the bounding contours and other visible contours on 3-D objects. 
We will make use of the following definitions. Given an object and a viewing direction, 
the rim is the set of all the points on the object's surface, whose normal is perpendicular 
to the viewing direction [Koenderink & Van Doom 1979]. This set is also called the 
contour generator [Marr 1977]. A silhouette is an image generated by the orthographic 
projection of the rim. In the analysis below we assume that every point along the silhou- 
ette is generated by a single rim point. An edge map of an object usually contains the 
silhouette, which is generated by its rim. 

We will examine below two cases. The case of objects with sharp edges, and the case 
of objects with smooth boundary contours. The difference between these two cases is 
illustrated in Figure 1. For an object with sharp edges, such as the cube in Fig. 1 (a & 



b), the rim is stable on the object as long as the edge is visible. In contrast, a rim that 
is generated by smooth bounding surfaces, such as in the ellipsoid in Fig. 1 (c Sz d), is 
not fixed on the object, but changes continuously with the viewpoint. 

1.3 Objects with Sharp Edges 

In the discussion below we examine the case of objects with sharp edges undergoing 
different transformations followed by an orthographic projection. In each case we show 
how the image of an object obtained by the transformation in question can be expressed as 
the linear combination of a small number of pictures. The coefficients of this combination 
may be different for the x- and y- coordinates. That is, the intermediate view of the object 
may be given by two linear combinations, one for the ^-coordinates and the other for the 
^-coordinates. In addition, certain functional restrictions may hold among the different 
coefficients. 

To introduce the scheme we first apply it to the restricted case of rotation about the 
vertical axis, then examine more general transformations. 

1.3.1 3-D Rotation Around the Vertical Axis 

Let Pi and P 2 be two images of an object rotating in depth around the vertical axis 
(Y-axis). P 2 is obtained from P\ following a rotation by an angle a, (a ^ Ictc). Let P be 
a third image of the same object obtained from P\ by a rotation of an angle 9 around the 
vertical axis. The projections of a point p = (x,y,z) £ in the three images are given 
by: 

Pi = (xi,yi) = (x,y) e A 

Vi — (^2,2/2) = (xcosa + z sin a, y) £ P2 
p = (x,y) — (xcosO -f 2: sin 0, y) £ P 

Claim: Two scalars a and b exist, such that for every point p £ O: 

x = ax\ -f bx2 



wi 



th: 



a 2 -f b 2 -f 2a6cos a = 1 

Proof: The scalars a and b are given explicitly by: 
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Figure 1: Changes in the rim during rotation, (a) A bird's eye view of a cube, (b) The cube 
after rotation. In both (a) and (b) points p, q lie on the rim. (c) A bird's eye view of an 
ellipsoid, (d) The ellipsoid after rotation. The rim points p, q in (c) are replaced by p', q' 
in (d). (e) An ellipsoid in a frontal view, (f) The rotated ellipsoid (outer), superimposed on 
the appearance of the rim, as a planar space curve after rotation by the same amount (inner) 
(From [Basri & Ullman, 1988]). 



Then: 

sin(a-0) sin#. 
ax x + bx 2 — : x + - (x cosa + z sin a) = xcosO + z 



sin a sina v ' , ~sm0_£ 



Therefore, an image of an object rotating around the vertical axis is always a linear 
combination of two model images. It is straightforward to verify that the coefficients a 
and 6 satisfy the above constraint. It is worth noting that the new view P is not restricted 
to be an intermediate view (that is, the rotation angle may be larger than a). Finally, 
it should be noted that we do not deal at this stage with occlusion, we assume here that 
the same set of points is visible in the different views. 

1.3.2 Linear Transformations in 3-D Space 

Let be a set of object points. Let P u P 2 and P 3 be three images of 0, obtained by 
applying 3x3 matrices R, S and T to 0, respectively. (In particular, R can be the 
identity matrix, and R, S two rotations producing the second and third views.) Let P 
be a fourth image of the same object obtained by applying a different 3x3 matrix U to 
0. Let n, si, ti and Ul be the first row vectors of R, S, T and U, respectively, and let 
r 2 , s 2 , t 2 and u 2 be the second row vectors of R, S, T and U respectively. The positions 
of a point p 6 in the four images are given by: 

Pi = (xi,yi) = {r lP , r 2 p) 

P2 = (x 2 ,y 2 ) = (sip, s 2 p) 

P3 = (x 3 ,y 3 ) = (tip,t 2 p) 

P = (x,y) = (uxp, u 2 p) 

Claim: If both sets {r^si,^} and {r 2 ,s 2 ,t 2 } are linearly independent, then there 
exist scalars a u a 2 , a 3 and 6 l5 6 2 , 6 3 such that for every point p 6 it holds that: 

x = axXi + a 2 x 2 + a 3 x 3 
V = &i3/i + b 2 y 2 + b 3 y 3 

Proof: {n, Sl , tj are linearly independent. Therefore, they span ft 3 , and there exist 
scalars a 1? a 2 and a 3 such that: 

Ui = a X Yi + a 2 s! + a 3 t! 

Since: 

x = Uip 



It follows that: 

x = a^p + a 2 s x p + a^p 
Therefore: 

x = a 1 x 1 + a 2 x 2 + a 3 x 3 
In a similar way we obtain that: 

V = hyi + b 2 y 2 + 63^3 

Therefore, an image of an object undergoing a linear transformation in 3-D space is 
a linear combination of three model images. 

1.3.3 General Rotation in 3-D Space 

Rotation is a nonlinear subgroup of the linear transformations. Therefore, an image of 
a rotating object is still a linear combination of three model images. However, not every 
point in this linear space represents a pure rotation of the object. Indeed, we can show 
that only points that satisfy the following three constraints represent images of a rotating 

Claim: The coefficients of an image of a rotating object must satisfy the three following 
constraints: & 

|| air 1 + a 2 si + a 3 ti || = 1 
II V 2 + b 2 s 2 + 6 3 t 2 || = 1 
(airi +a 2 si +a 3 ti) (&ir 2 + & 2 s 2 + & 3 t 2 ) = 
Proof: U is a rotation matrix. Therefore: 

II mil = i 
l|u 2 || = i 

ui u 2 = 

And the required terms are obtained directly by substituting Ul and u 2 with the appro- 
priate linear combinations. It also follows immediately that if the constraints are met 
then the new view represents a possible rotation of the object. 

These functional constraints are second degree polynomials in the coefficients and 
therefore span a nonlinear manifold within the linear subspace. In order to check whether 
a specific set of coefficients represents a rigid rotation, the values of the matrices R, S and 
1 are required. These can be retrived by applying methods of "structure from motion" 
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to the model views. Ullman [1979] showed that in case of rigid transformations four 
corresponding points in three views are sufficient. A linear algorithm that can be used 
to recover the rotation matrices has been suggested by Huang Sz Lee [1989]. (The same 
method can be extended to deal with scale changes, in addition to the rotation.) 

It should be noted that in some cases the explicit computation of the rotation matrices 
will not be necessary. First, if the set of allowable object transformations includes the 
entire set of linear 3-D transformations (including non-rigid stretch and shear), then 
no additional test of the coefficients is required. Second, if the transformations are 
constrained to be rigid, but the test of the coefficient is not performed, then the penalty 
may be some "false positives" misidentifications. If the image of one object happens to 
be identical to the projection of a (non-linear) rigid transformation applied to another 
object, then the two will be confuseable. If the objects contain a sufficient number of 
points (five or more), the likelihood of such an ambiguity becomes negligible. Finally, it is 
worth noting that it is also possible to determine the coefficient of the constraint equations 
above without computing the rotation matrices, by using a number of additional views 
(see also section 1.3.5). 

Regarding the independence condition mentioned above, for many triplets of rotation 
matrices P, S and T both {r^s^tx} and {r 2 ,s 2 ,t 2 } will in fact be linearly independent. 
It will therefore be possible to select a non degenerate triplet of views (P l5 P 2 and P 3 ), in 
terms of which intermediate views are expressible as linear combinations. Note, however, 
that in the special case that P is the identity matrix, S is a pure rotation about the 
X-axis, and T about the K-axis, the independence condition does not hold. 

1.3.4 Rigid Transformations and Scaling in 3-D Space 

Rotation, translation and scaling in 3-D space can be represented as linear transforma- 
tions in 4-D space using homogenous coordinates. Therefore, an image of a rigid object 
can be expressed as the linear combination of four model images. In fact, only three 
different snapshots of the object are required, the fourth view can be derived from them. 

Let be a set of object points. Let Pi, P 2 and P 3 be three images of 0, obtained by 
applying the 3x3 rotation matrices P, S and T to 0, respectively. Let P be a fourth 
image of the same object obtained by applying a 3 x 3 rotation matrix U to 0, scaling 
by a scale factor s, and translating by a vector (t x ,t y ). Let ri, s l5 t^ and Ux be again the 
first row vectors of P, S, T and U, and r 2 , s 2 , t 2 and u 2 the second row vectors of P, 5, 



T and U, respectively. For any point p^O, its positions in the four images are given by: 

Pi = (xuVi) = (rip, r 2 p) 

P2 = (^2, J/2) = (sip, s 2 p) 

P3 = (x 3 ,y 3 ) = (tip, t 2 p) 

p = (x,y) = (surf + t x , su 2 p + t y ) 

Claim: If both sets {ri,Si,ti} and {r2,S2,t2} are linearly independent, then there 
exist scalars «i, a 2 , a 3j a 4 , and 61, 6 2 , 63, 64, such that for every point p E it holds that: 

x = ai^i + a 2 x 2 + a 3 x 3 + a 4 
V = Wy\ + b 2 y 2 + b 3 y 3 + &4 

with the coefficient satisfying the two constraints: 

|| airi + a 2 si + a 3 ti || = || b 1 v 2 + 6 2 s 2 + 6 3 t 2 || 

(a^Tx -\-a 2 s 1 + a 3 t 1 )(6 1 r 2 + 6 2 s 2 + 6 3 t 2 ) = 

Proof: {ri,Si,ti} are linearly independent. Therefore, they span 7Z 3 , and there exist 
scalars ci, c 2 and c 3 such that: 



Since: 
Then 
Let: 



Uj = dTj + C 2 Sj + C 3 tj 

x = s(uip) + t x 

X = SC^iP + 5C 2 SiP + 5C 3 txP + t x 



a\ = sci 
a 2 = sc 2 
a 3 = sc 3 

We obtain that: 

x = a x x x + a 2 x 2 + a 3 x 3 + a 4 

In a similar way we obtain that: 

y = biy! + b 2 y 2 + b 3 y 3 + 64 



U is rotation matrix, therefore: 

l|ui|| = 1 

||u 2 || = 1 

UiU 2 = 

It follows that: 

|| SUj || = || su 2 || 

(sui)(su 2 ) = 

And the constraints are obtained directly by substituting the appropriate linear combi- 
nations for sui and su 2 . 

1.3.5 Using Two Views Only 

In the scheme described above, any image of a given object (within a certain range of 
rotations) is expressed as the linear combination of three fixed views of the object. For 
general linear transformations, it is also possible to use instead just two views of the 
object. (This observation was made independently by T. Poggio and R. Basri.) 

Let be again a rigid object (a collection of 3-D points). P x is a 2-D image of 0, 
and P 2 the image of following a rotation by R (a 3 x 3 matrix). We will denote by i*i, 
r 2 , r 3 , the three rows of R, and by ei, e 2 , e 3 , the three rows of the identity matrix. For 
a given 3-D point p in 0, its coordinates (zi,2/i) in the first image view are x x = eip, 
Vi = e 2 p. Its coordinates {x 2 ,y 2 ) in the second view are given by: x 2 = i^p, y 2 = r 2 p. 

Consider now any other view obtained by applying another 3x3 matrix U to the 
points of 0. The coordinates (x,y) of p in this new view will be: 

x = Uip, y = u 2 p 

(where u l5 u 2 , are the first and second rows of U, respectively). 
Assuming that e 1? e 2 and r x span 1Z 3 (see below), then: 

iii = «iei + a 2 e 2 + a 3 ri 

for some scalars ai,a 2 ,a 3 . Therefore: 

x = Ujp = (a^! + a 2 e 2 + a^p = a x x x + a 2 y x + a 3 x 2 

This equality holds for every point p in 0, Let x x be the vector of all the ^-coordinates 
of the points in the first view, x 2 in the second, x in the third, and yi the vector of y- 
coordinates in the first view. Then: 

x = dXi + a 2 yi + a 3 x 2 



Here x l5 y! and x 2 are used as a basis for all of the views. For any other image of the 
same object, its vector x of ^-coordinates is the linear combination of these basis vectors. 

Similarly, for the y-coordinates: 

y = 61X1 + b 2 yi + 63X2 

The vector y of y-coordinates in the new image is therefore also the linear combination 
of the same three basis vectors. In this version the basis vectors are the same for the 
x- and y-coordinates, and they are obtained from two rather then three views. One can 
view the situation as follows. Within an n-dimensional space, the vectors Xi, yi, x 2 span 
a 3-dimensional subspace. For all the images of the object in question, the vectors of 
both the x- and y-coordinates must reside within this 3-dimensional subspace. 

Instead of using (e 1 ,e 2 ,r 1 ) as the basis for 1Z 3 we could also use (e^e^r^). One of 
these bases spans 7Z 3 , unless the rotation R is a pure rotation around the line of sight. 

The use of two views described above is applicable to general linear transformations 
of the object, and, without additional constraints, it is impossible to distinguish between 
rigid and linear but not rigid transformations of the object. To impose rigidity (with 
possible scaling) the coefficients (ai,a 2 ,a 3 , W,b 2 ,b 3 ) must meet two simple constraints. 
Since U is now a rotation matrix (with possible scaling), 

Ui u 2 = 

II U l II = II U 2 || 

In terms of the coefficients a,-, &;, Ui u 2 = implies: 

ai&i + a 2 6 2 + a 3 b 3 + (a x b 3 + a 3 bi)rn + (a 2 b 3 + a 3 6 2 )r i2 = 
The second constraint implies: 

ai 2 + a 2 2 + a 3 - b^ - b 2 - b 3 = 2(&i& 3 - a x a 3 )r xx + 2{b 2 b 3 - a 2 a 3 )r 12 

A third view can therefore be used to recover, using two linear equations, the values 
of rii and r 12 . (rn and 7*12 can in fact be determined to within a scale factor from 
the first two views, only one additional equation is required.) The full scheme for rigid 
objects is then the following. Given an image, determine whether the vectors x, y, are 
linear combinations of Xi, yi and x 2 . Only two views are required for this stage. Using 
the values of rn and 7*12, test whether the coefficients a,-, 6,, (i = 1,2,3) satisfy the two 
constraints above. 

It is of interest to compare this use of two views to structure-from- motion (SFM) 
techniques for recovering 3-D structure from orthographic projections. It is well known 
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that three distinct views are required, two are insufficient [Ullman 1979]. Given only two 
views and an infinitesmal rotation (the velocity field), the 3-D structure can be recovered 
to within depth-scaling [Ullman 1983]. It is also straightforward to establish that if the 
two views are separated by a general affine transformation of the 3-D object (rather 
than a rigid one), then the structure of the object can be recovered to within an affine 
transformation. 

Our use of two views above for the purpose of recognition is thus related to known 
results regarding the recovery of structure from motion. Two views are sufficient to 
determine the object's structure to within an affine transformation, and three are required 
to recover the full 3-D structure of a rigidly moving object. It can also be observed that 
an extension of the scheme above can be used to recover structure from motion. It was 
shown how the scheme can be used to recover rn and r i2 . r 2 \ and r 22 can be recovered in 
a similar manner. Consequently, it becomes possible to recover 3-D structure and motion 
in space based on three orthographic views, using linear equations. 

1.3.6 Summary 

In this section we have shown that an object with sharp contours, undergoing rigid 
transformations and scaling in 3-D space followed by an orthographic projection, can be 
expressed as the linear combination of four images of the same object. In this scheme, 
the model of a 3-D object consists of a number of 2-D pictures of it. The pictures are in 
correspondence, in the sense that it is known which are the corresponding points in the 
different pictures. Two images are sufficient to represent general linear transformations 
of the object. Three images are required to represent rotations in 3-D space, and one 
additional image is required to represent translations. The scaling does not require any 
additional image, since it is represented by a scaling of the coefficients. As mentioned 
above, the fourth picture can be generated internally, therefore only three different snap- 
shots of the object are required. 

The linear combination scheme assumes that the same object points are visible in 
the different views. When the views are sufficiently different, this will no longer hold, 
due to self-occlusion. To represent an object from all possible viewing directions (e.g. 
both "front" and "back"), a number of different models of this type will be required. 
This notion is similar to the use of different object aspects suggested by Koenderink h 
Van Doom [1979]. (Other aspects of occlusion are examined in the final discussion and 
Appendix D.) 

The linear combination scheme described above was implemented and applied first 
to artificially created images. Figure 2 shows examples of object models and their linear 
combinations. The figure shows how 3-D similarity transformations can be represented 
by the linear combinations of four images. 
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Figure 2: (a) Three model pictures of a cube. The second picture was obtained by rotating 
the cube by 30° around the X-axis, then by 30° around the F-axis. The third picture was 
obtained by rotating the cube by 30° around the F-axis, then by 30° around the X-axis, (b) 
Three model pictures of a pyramid taken with the same transformations as the pictures in 
(a), (c) Two linear combinations of the cube model. The left picture was obtained using the 
following parameters: the ^--coefficients are (0.343,-2.618,2.989,0), and the y-coefficients are 
(0.630, -2.533,2.658, 0), which correspond to a rotation of the cube by 10°, 20° and 45° around 
the X-, F- and Z-axes respectively. The right picture was obtained using the following pa- 
rameters: z-coemcients (0.455,3.392,-3.241,0.25), y-coefficients (0.542,3.753,-3.343,-0.15). 
These coefficients correspond to a rotation of the cube by 20°, 10° and -45° around the X-, 
F- and Z-axes respectively, followed by a scaling of factor 1.2, and a translation of (25, -15) 
pixels, (d) Two linear combinations of the pyramid model taken with the same parameters as 
the pictures in (c). 
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1.4 Objects with Smooth Boundaries 

The case of objects with smooth boundaries is identical to the case of objects with sharp 
edges as long as we deal with translation, scaling and image rotation. The difference 
arises when the object rotates in 3-D space. This case is discussed in [Basri & Ullman, 
1988], where we have suggested a method for predicting the appearance of such objects 
following 3-D rotations. This method, called "the curvature method", is summarized 
briefly below. 

A model is represented by a set of 2-D contours. Each point p = (x,y) along the 
contours is labeled with its depth value z, and a curvature value r. The curvature value 
is the length of a curvature vector at p, r =|| (r x ,r y ) ||. (r x is the surface's radius of 
curvature at p in a planar section in the X direction, r y in the Y direction.) This vector 
is normal to the contour at p. Let V^ be an axis lying in the image plane and forming an 
angle <j) with the positive X direction, and r^ be a vector of length r^ = r y cos <f)-r x sin <j> 
and perpendicular to V^. When the object is rotated around V+ we approximate the new 
position of the point p in the image by: 

p = R(p - r ) + r> (1) 

where R is the rotation matrix. The equation has the following meaning. When viewed 
in a cross section perpendicular to the rotation axis V^, the surface at p can be approx- 
imated by a circular arc with radius r^ and center at p — r^. The new rim point p' is 
obtained by first applying R to this center of curvature (p - r^), then adding the radius 
of curvature r^. This expression is precise for circular arcs, and gives a good approxima- 
tion for other surfaces provided that the angle of rotation is not too large (see [Basri & 
Ullman 1988] for details). The depth and the curvature values were estimated in [Basri 
k Ullman 1988] using three pictures of the object, and the results were improved using 
five pictures. In this section we show how the curvature method can also be replaced by 
linear combinations of a small number of pictures. In particular, we use three images to 
represent rotations around the vertical axis, and five images for general rotations in 3-D 
space. 

1.4.1 3-D Rotation Around the Vertical Axis 

When an object rotates around the vertical (Y) axis by an angle 0, r^ in equation (1) 
above becomes ri, which is a horizontal vector of length r* = r x . Therefore, the new 
position of a point p = (x, y) is given by p' = (x', y') where: 



x' 



.i 



[x -r x )cos$ -\- zsinO + r x = xcos$ + zs'mO + r x (l - cosO) 



y = y 
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This expression gives the new coordinates (x',y') in terms of the original coordinates 
(x,y), the rotation angel 0, the local depth z and the radius of curvature r x . Next we 
show that the new image can be expressed instead as the linear combination of three 2-D 
images. 

Let Pi, P 2 and P 3 be three images of an object rotating around the vertical (Y) 
axis. P 2 is obtained from P x by a rotation by an angle a, and P 3 by a rotation by an 
angle (3 (a ^ /3, a, (3 ^ kir). Let P be another image of the same object obtained from 
Pi by a rotation by an angle around the vertical axis. We assume that the curvature 
scheme gives sufficiently close approximation to the images. Under this assumption, the 
positions of a point p = (x, y,z) € can be expressed in the following manner: 

Pi = {x\,yi) = (x,y) e A 

Vi — (^2,2/2) = (zcosa + 2: sin a + r x (l — cos a), y) € P 2 
p 3 = (2:3,2/3) = (xcos(3 + z sin (3 + r x (l -cos (3), y) € P 3 
p = (x,y) = (xcos$ + zsinO + r x (l — cosO), y) G P 

Claim: P is a linear combination of Pi,P 2 ,P 3 . That is, there exist scalars a, b and c 
such that for every four corresponding points Pi,p 2 ,P3->P' 

x = axi + bx 2 + cx 3 

with: 

a + 6 + c= 1 

and: 

a 2 + 6 2 + c 2 + 2a6 cos a + 2<zc cos (3 + 26c cos(/? — a) = 1 

Proof: We construct a, 6 and c explicitly. Let: 

sin(a — 0) — sin(/? — 0) — sin(a — (3) 



a = 



b = 



c = 



sin a — sin (3 — sin(a — j3) 
-sin/? + sinfl + sin(/?-6>) 
sin a — sin /? — sin(a — /?) 
sin a: — sin — sin(a — 0) 



sin a — sin /? — sin(o: — /?) 
{a ^ (3 and a, (3 ^ kir implies that sin a — sin/? — sin(a - /?) ^ 0). It follows that: 

aaji + for 2 -f C2? 3 = 
sin(a -0)- sin((3 - 0) - sin(a - (3) 



sin a — sin (3 — sin(a — /9) 
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•z+ 



sin/3 + sin0 + sin(/?-0) . 

(a; cos a + z sin a + r x \\ — cos a))+ 



sin a — sin /? — sin(o; — /?) 
sin a — sin — sin(a: — 0) 



(x cos /? + z sin /? 4- r x {\ — cos /?)) = 



sin a — sin /? — sin(a — (3) 

= (a: cos + 2 sin + r x (l — cos #)) = £ 

Therefore, an image of an object rotating around the vertical axis and described accu- 
rately by the curvature method is always a linear combination of three model images. In 
addition, if we substitute the values above for a, 6 and c in the two functional constraints 
we obtain that: 

a+b+c= 1 

a 2 + b 2 + c 2 + 2ab cos a + 2ac cos (3 + 2bc cos(j3 — a) = 1 

1.4.2 General Rotation in 3-D Space 

In this section we first derive an expression for the image deformation of an object with 
smooth boundries under general 3-D rotation. We then use this expression to show that 
the deformed image can be expressed as the linear combination of five images. 

Computing the transformed image. 

Using the curvature method we can predict the appearance of an object undergoing a 
general rotation in 3-D space as follows. A rotation in 3-D space can be decomposed 
into the following three successive rotations: a rotation around the Z-axis, a subsequent 
rotation around the X axis, and a final rotation around the Z-axis, by angles a, (3 and 
7 respectively. Since the Z-axis coincides with the line of sight, a rotation around the 
Z-axis is simply an image rotation. Therefore, only the second rotation deforms the 
object, and the curvature method must be applied to it. Suppose that the curvature 
vector at a given point p = (x,y) before the first Z-rotation is (r x ,r y ). Following the 
rotation by a it becomes r' x = r x cos a — r y sin a and r' y = r x sin a + r y cos a. The second 
rotation is around the X-axis, and therefore the appropriate r^ to be used in eq. (1) 
becomes r' = r x sin a + r y cos a. The complete rotation (all three rotations) therefore 
takes a point p = (x,y) through the following sequence of transformations: 

(x, y) — >■ (x cos a — y sin a, x sin a -f y cos a) ► 

(x cos a — y sin a, (x sin a + y cos a) cos (3 — z sin (3 + (r x sin a + r y cos a)(l — cos /?)) — > 
((x cos a— y sin a) cos 7+((# sin a+y cos a) cos /3—z sin (3-\-(r x sin a+r y cos o;)(l— cos /?)) sin 7, 
(xcosa—ys'ma) sin7-f((:csina:-f2/cosa:) cos (3— z sin /3-\-(r x sin a+r y cosoj)(1— cos /3)) cos 7) 
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(The first of these transformations is the first Z-rotation, the second is the deformation 
caused by the X-rotation, and the third is the final Z-rotation). 

This is an explicit expression of the final coordinates of a point on the object's contour. 
This can also be expressed more compactly as follows. Let R = {r ti } be a 3 x 3 rotation 
matrix. Let a, and 7 be the angles of the Z-X-Z rotations represented by R. We 
construct a new matrix R' = {r^} of size 2 x 5 as follows: 

R'=( ru ri2 ri3 sma (! -cos/?) sin 7 cosa(l - cos ^) sin 7 \ 
\r 2 i r 22 r 23 sina(l -cos/?) cos 7 cosa(l - cos £) cos 7/ 

Let p = (x,y) be a contour point with depth z and curvature vector {r x ,r y ), and let 
P = (x,y,z,r x ,r y ). Then, the new appearance of p after a rotation R is applied to the 
object is described by: 

p' = R'p (2) 

This is true because eq. (2) is equivalent to eq. (1) in section 1.4 with the appropriate 
values for r^. 

Expressing the transformed image as a linear combination. 

Let O be a set of points of an object rotating in 3-D space. Let P u P 2 , P 3 , P 4 and 
P 5 be five images of 0, obtained by applying a rotation matrix R U ...,R 5 respectively. 
P is an image of the same object obtained by applying a rotation matrix R to 0. Let 
R[, ..., R' 5 , R' be the corresponding 2x5 matrices representing the transformations applied 
to the contour points according to the curvature method. Finally, let r^...,^,!- denote 
the first row vectors of R[, ..., # 5 , R', and s u ..., s 5 , s the second row vectors R[ ,'..., R' 5 , R' 
respectively. The positions of a point p = (x, y) e O, p = {x,y,z,r x ,r y ), in the 'six 
pictures is then given by: 

Pi = ( x i,Vi) = (r,-p, Sip) e Pi,l<i<5 
P = (x,y) = (rp,sp) e P 

Claim: If both sets {n, ..., r 5 } and { Sl , ..., s 5 } are linearly independent vectors then 
there exist scalars a u ..., a 5 and b u ..., 65 such that for every point p € it holds that: 

5 

x = y^ a i x i 
t=i 

5 

y = J2 b iVi 

Proof: {n, ..., r 5 } are linearly independent. Therefore, they span ft 5 , and there exist 
scalars a!,...,a 5 such that: 

5 

t'=i 
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Since: 
Then: 

That is: 

In a similar way we obtain that: 



x = rp 

5 

5 

X — y ^ tij'Xj 

1=1 

5 

t=l 
In addition, for pure rotation, the coefficients of this linear combinations satisfy seven 
functional constraints. These constraints, which are second degree polynomials, are given 
in Appendix A. 

Again, one may or may not actually test for these additional constraints. If the test 
is ommitted, the probablity of a false-positive misidentification is slightly increased. 

As in the case of sharp boundaries, it is possible to use mixed x- and y-coordinates to 
reduce the number of basic views for genral linear transformations (Section 1.3.5). For 
example, one can use five basis vectors (xi,x 2 ,x 3 ,yi,y 2 ) taken from three distict views 
as the basis for the x- and y-coordinates in all other views. 

1.4.3 Rigid Transformation and Scaling in 3-D Space 

So far we have shown that an object with smooth boundaries, represented by the cur- 
vature scheme, and undergoing a rotation in 3-D space, can be represented as a linear 
combination of 2-D views. The method can be easily extended to handle translation by 
taking, as before, an additional image of the object. The linear combination scheme for 
objects with smooth bounding contours is thus a direct extension of the scheme in section 
1.3 for objects with sharp boundaries. In both cases, object views are expressed as the 
linear combination of a small number of pictures. The scheme for objects with sharp 
boundaries can be viewed as a special case of the more general one, when r, the radius 
of curvature, vanishes. In practice, we found that it is also possible to use the scheme for 
sharp boundaries, that uses a smaller number of views in each model, for general objects, 
provided that r is not too large (and at the price of increasing the number of models). 
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1.4.4 Summary 

In this section we have shown that an object with smooth boundaries undergoing rigid 
transformations and scaling in 3-D space followed by an orthographic projection, can be 
expressed (within the approximation of the curvature method) as the linear combination 
of six images of the object. Five images are used to represent rotations in 3-D space, 
and one additional image is required to represent translations. (In fact, although the 
coordinates are expressed in terms of five basis vectors, only three distinct views are 
needed for a general linear transformation.) The scaling does not require any additional 
image since it is represented by a scaling of the coefficients. This scheme was implemented 
and applied to images of 3-D objects. 

Figures 3 and 4 show the application of the LC (linear combination) method to com- 
plex objects with smooth bounding contours. Since the rotation was about the vertical 
axis, three 2-D views were used for each model. The figure shows a good agreement 
between the actual image and the appropriate linear combination. Although the objects 
are similar, they are easily discriminable by the LC method within the entire 60° rotation 
range. 

Finally, it is worth noting that the modeling of objects by linear combinations of 
stored pictures is not limited only to rigid objects. The method can also be used to 
deal with various types of non-rigid transformations, such as articulations and non-rigid 
stretching. For example, in the case of an articulated object, the object is composed of a 
number of rigid parts linked together by joints that constraint the relative movement of 
the parts. We saw that the x- and y-coordinates of a rigid part are constrained to a 4-D 
subspace. Two rigid parts reside within an 8-D subspace, but, because of the constraints 
at the joints, they usually occupy a smaller subspace (e.g., 6-D for a planar joint). 
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Figure 3: (a) Three model pictures of a VW car for rotations around the vertical axis. The sec- 
ond and the third pictures were obtained from the first by rotations of ±30° around the F-axis. 
(b) Two linear combinations of the VW model. The ^-coefficients are (0.556,0.463,-0.018) 
and (0.582, -0.065,0.483) which correspond to a rotation of the first model picture by ±15°. 
These are artificial images, created by linear combinations of the first three views, rather than 
actual views, (c) Real images of a VW car. (d) Matching the linear combinations to the real 
images. Each contour image is a linear combination super-imposed on the actual image. The 
agreement is good within the entire range of ±30°. (e) Matching the VW model to pictures of 
the Saab car. 
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Figure 4: (a) Three model pictures of a Saab car taken with approximately the same trans- 
formations as the VW model pictures, (b) Two linear combinations of the Saab model. The 
a; -coefficients are (0.601,0.471,-0.072) and (0.754,-0.129,0.375) which correspond to a rota- 
tion of the first model picture by ±15°. (c) Real images of a Saab car. (d) Matching the linear 
combinations to the real images, (e) Matching the Saab model to pictures of the VW car. 
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2 Determining the Alignment Coefficients 

In the previous sections we have shown that the set of possible views of an object can 
often be expressed as the linear combination of a small number of views. In this sec- 
tion we examine the problem of determining the transformation between a model and a 
viewed object. The model is given in this scheme as a set of k corresponding 2-D images 
{M l5 ...,M fc }. A viewed object P is an instance of this model if there exists a set of 
coefficients {a u ...,ak} (with a possible set of restrictions F(a 1 ,...,a k ) = 0) such that: 

P = a 1 M 1 + ... + a fc M fc (3) 

In practice we may not obtain a strict equality. We will attempt to minimize, therefore, 
the difference between P and a x M x + ... + a k M k . The problem we face is how to determine 
the coefficients {a!,..., a*}. In the following subsections we will discuss three alternative 
methods for approaching this problem. 

2.1 Minimal Alignment: Using a Small Number of Corre- 
sponding Features 

The coefficients of the linear combination that align the model to the image can be deter- 
mined using a small number of features, identified in both the model and the image to be 
recognized. This is similar to previous work in the framework of the alignment approach 
[Fishier k Bolles 1981, Huttenlocher k Ullman 1987, Lowe 1985, Ullman 1986,1989]. It 
has been shown that three corresponding points or lines are usually sufficient to deter- 
mine the transformation that aligns a 3-D model to a 2-D image [Ullman 1986,1989, 
Huttenlocher k Ullman 1987, Shoham k Ullman 1988], assuming the object can undergo 
only rigid transformations and uniform scaling. In previous methods, 3-D models of the 
object were stored. The corresponding features (lines and points) were then used to 
recover the 3-D transformation separating the viewed object from the stored model. 

The coefficients of the linear combination required to align the model views with the 
image can be derived in principle, as in previous methods, by first recovering the 3-D 
transformations. They can also be derived directly, however, by simply solving a set of 
linear equations. This method requires k points to align a model of k pictures to a given 
image. Therefore, four points are required to determine the transformation for objects 
with sharp edges, and six points for objects with smooth boundaries. In this way we 
can deal with any transformation that can be approximated by linear combinations of 
pictures, without recovering the 3-D transformations explicitly. 

The coefficients of the linear conbination are determined by solving the following 
equations. We assume that a small number of corresponding points (the "alignment 
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points") have been identified in the image and the model. Let X be the matrix of the 
^-coordinates of the alignment points in the model. That is, x {j is the z- coordinates of 
the j'th point in the z'th model-picture. p x is the vector of z- coordinates of the alignment 
points in the image, and a is the vector of unknown alignment parameters. The linear 
system to be solved is then la = p x . The alignment parameters are given by a = X _1 p x 
if an exact solution exists. We may use an overdetermined system (by using additional 
points), in which case a = X+ p x (where X + denotes the pseudo- inverse of X). The 
matrix X+ does not depend on the image and can be pre-computed for the model. The 
recovery of the coefficients therefore requries only a multiplication of p x by a known 
matrix. Similarly, we solve for Yb = p y to extract the alignment parameters b in the 
y-direction from Y (the matrix of y- coordinates in the model), and p y (the corresponding 
y-coordinates in the image). 

It is also worth noting that the computation can proceed in a similar fashion on the 
basis of correspondence between straight line segments rather than points. In this case, 
due to the "aperture problem" [Marr & Ullman 1981], only the perpendicular component 
(to the contour) of the displacement can be measured. This component can be used, 
however, in the equations above. In this case each contour segment contributes a single 
equation (as opposed to a point correspondence, that gives two equations). 

One question that may arise in this context is whether the visual system can be 
expected to extract reliably a sufficient number of alignment features. Two comments 
are noteworthy. First, this difficulty is not specific to the linear combination scheme, 
but applies to other alignment schemes as well. Second, although the task is not simple, 
the phenomenon of apparent motion suggests that mechanisms for establishing feature 
correspondence do in fact exist in the visual system. 

It is interesting to note in this regard that the correspondence established during 
apparent motion appears to provide sufficient information for the purpose of recognition 
by linear combinations. For example, when the car pictures in figure 5(a) are shown 
in apparent motion, the points marked in the left picture appear perceptually to move 
and match the corresponding points marked in the right picture. These points, with 
the perceptually established match, were used to align the model and images in figure 
5. That is, the coordinates of these points were used in the equations above to recover 
the alignment coefficients. The model contained six pictures of a Saab car in order to 
cover all rigid transformations for an object with smooth boundaries. As can be seen, 
a close agreement was obtained between the image and the transformed model. (The 
model contained only a subset of the contours, the ones that were clearly visible in all of 
the different pictures.) 
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Figure 5: Aligning a model to images using corresponding features, (a) Two images of a Saab 
car, and one of the six model pictures, (b) The corresponding points used to align the model 
to the images. The correspondence was determined using apparent motion, as explained in the 
text, (c) The transformed model, (d) The transformed model super-imposed on the original 
images. 
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2.2 Searching for the Coefficients 

An alternative method to determine the best linear combination is by a search in the 
space of possible coefficients. In this method we choose some initial values for the set 
{«!,..., a k ] of coefficients, then we apply a linear combination to the model using this set 
of coefficients. We repeat this process using a different set of coefficients, and take the 
coefficient values that produced the best match of the model to the image. 

The most problematic aspect of this method is that the domain of coefficients might 
be large, therefore the search might be prohibitive. We can reduce the search space 
by first performing a rough alignment of the model to the image. The identification of 
general features in both the image and the model, such as a dominant orientation, the 
center of gravity, and a measurement of the overall size of the imaged object, can be used 
for compensating roughly for image rotation, translation and scaling. Assuming that 
this process compensates for these transformations up to a bounded error, and that the 
rotations in 3-D space covered by the model are also restricted, then we could restrict the 
search for the best coefficients to a limited domain. Moreover, the search can be guided 
by an optimization procedure. We can define an error measure (for instance, the area 
enclosed between the transformed model and the image) that must be minimized, and use 
minimization techniques such as gradient descent to make the search more efficient. The 
preliminary stage of rough alignment may help preventing such methods from reaching 
a local minimum instead of the global one. 

2.3 Linear Mappings 

The linear combination scheme is based on the fact that a 3-D object can be modeled by 
the linear combination of a small number of pictures. That is, the set of possible views of 
an object is embedded in a linear space of a low dimensionality. We can use this property 
to construct a linear operator that maps each member of such a space to a predefined 
vector, which identifies the object. This method is different from the previous two in 
that we do not recover explicitly the coenfficients (a u ...,a fc ) of the linear combination. 
Instead, we assume that a full correspondence has been established between the viewed 
object and the stored model. We then use a linear mapping to test wether the viewed 
object is a linear combination of the model views. 

Suppose that a pattern P is represented by a vector p of its coordinates (e.g., 
( x uVii x 2 ,y 2 , .-., x n ,y n )). Let P t and P 2 be two different patterns representing the same 
object. We can now construct a matrix L that maps both Pl and p 2 to the same output 
vector q. That is Lpx = Lp 2 = q. Any linear combination ap 1 + 6p 2 will then be 
mapped to the same output vector q, multiplied by the scalar a + b. We can choose, for 

24 



example, q = pi, in which case any view of the object will be mapped by L to a selected 
"canonical view" of it. 

We have seen above that different views of the same object can usually be expressed as 
linear combinations Yl a iPi of a small number of representative views, Pi. If the mapping 
matrix L is constructed in such a manner that Lp\ = q for all the views P t in the same 
model, then any combined view p = X^a t pi, will be mapped by L to the same q (up to 
a scale), since Lp = (]C a i)q« 

L can be constructed as follows. Let {pi,..., p*} be k linearly independent vectors 
representing the model pictures (we can assume that they are all linearly independent 
since a picture that is not is obviously redundant). Let {p*;+i, ...,p n } be a set of vectors 
such that {pi, ..., p„} are all linearly independent. We define the following matrices: 

P = (pi,...,pfc,p fc +i,...,p n ) 

Q = (q,-,q,p*+i,-,Pn) 

LP = Q 



We require that: 
Therefore: 



L = QP- 1 

Note that since P is composed of n linearly independent vectors, the inverse matrix P -1 
exists, therefore L can always be constructed. 

By this definition we obtain a matrix L that maps any linear combination of the set of 
vectors {pi,...,pjt} to a scaled pattern aq. Furthermore, it maps any vector orthogonal 
to {pi,...,pjfc} to itself. Therefore, if p is a linear combination of {pi,-..,Pfc} with an 
additional orthogonal noise component, it would be mapped by L to q combined with 
the same amount of noise. 

In constructing the matrix X, one may use more than just k vectors p,, particularly if 
the input data is noisy. In this case a problem arises of estimating the best k-dimensional 
linear subspace spanned by a larger collection of vectors. This problem is treated in 
Appendix B. 

In our implementation we have used Lpi = for all the view vectors p; of a given 
object. The reason is that if a new view of the object p is given by J2 a,p, with £) a,- = 0, 
then Lp = 0. This means that the linear mapping L may send a legal view to the zero 
vector, and it is therefore convenient to choose the zero vector as the common output for 
all the object's views. If it is desirable to obtain at the output level a canonical view of 
the object such as pi rather than the zero vector, then one can use as the final output 
the vector px — Lp 
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The decision regarding whether or not p is a view of the object represented by L can 
be based on comparing || Lp || with || p ||. If p is indeed a view of the object, then this 
ratio will be small (exactly in the noise free condition). If the view is "pure noise" (in 
the space orthogonal to the span of (pi, ...pit)), then this ratio will be equal to 1. 

The general idea is somewhat similar to the associative mappings presented in [Koho- 
nen, Oja & Lehtio 1981]. However, in our scheme, unlike the one presented by Kohonen, 
Oja k Lehtio [1981], we take advantage of the fact that intermediate views of 3-D objects 
can be expressed as the linear combination of model views. Our scheme therefore uses 
the coordinates of image contours, rather than the image intensity values. 

Figure 6 shows the application of the linear mapping to two models of simple geo- 
metrical structures, a cube (a) and a pyramid (b). For each model we have constructed a 
matrix that maps any linear combination of the model pictures to the first picture of the 
model. The matrices were applied to images (c) and (e), and the results are presented in 
(d) and (f). 

2.4 The Use of Linear Receptive Fields 

Two of the three methods above are correspondence-based. They require the identifica- 
tion of corresponding features in the model and the image to be recognized to recover 
the coefficients or to apply the linear mapping. In this section we suggest a method that 
may be used (along with some other methods) to alleviate to some degree the problem 
of establishing a pointwise correspondence. 

The goal is to test whether a viewed pattern P is a linear combination of patterns 
in the model, without establishing a pointwise correspondence. To do this we use the 
following idea. Suppose that, as before, an intermediate view P is the linear combination 
of two views P a and P 2 in the model, that is, P = a P 1 + bP 2 . Let us take now an 
arbitrary group of / corresponding points in Pi, P 2 and P. Let ai,...,aj denote the / 
points in pattern Pi, 6 X ,...,6/ in P 2 and Ci,...,q in P. Let us denote by A x = E,=i a tx 
(i.e., the sum of the ^-coordinates of all the points in ai, ...aj). Similarly A y = Ei=i a »'y» 
B x = EL 6,-x, B y = EL &,-„, C x = El=i Ci x and C y = E!=i dy From the linear 
combination, P = aP x + &P 2 , it also follows that: 

C x = aA x + bB x 
C y = aA y + bB y 

(We have seen above examples in which different coefficients were used for the x- and 
^-coordinates. Here we have assumed for simplicity that they are identical). This demon- 
strates that we can use corresponding subsets of points without resolving the individual 
pointwise correspondence. 
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Figure 6: (a) Applying cube and pyramid matrices to the cubes of fig. 2. (b) Applying pyramid 
and cube matrices to the pyramids of fig. 2. Left column of pictures: the input images. Middle 
column: the result of applying the appropriate matrix to the images, these results are identical 
to the first model pictures (which serve as canonical views). Right column: the result of applying 
the wrong matrix to the images, these results are not similar to the canonical views. 
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It is worth mentioning that if we match a sufficient number of corresponding subsets 
of points, the exact point to point correspondence can also be resolved, and the two 
methods are equivalent. However, the number of subsets may be smaller than the number 
of points, or we can take subsets of points that are corresponding in most of the points, 
but not in all of them, and still obtain good results (as shown below). 

To use the above idea it becomes necessary to establish a correspondence between 
subsets of the patterns instead of the individual points. There are several possible ways 
to approach this problem. Here we propose a simple method, motivated in part by 
considerations of biological plausibility, that is based on the notion of linear receptive 
fields. 

A linear receptive field (LRF) is an operator that takes a weighted contribution of the 
points falling within a given region, using a linear weighting function. We will assume 
here that the LRF response is simply the average contribution of the points falling inside 
its region. That is, given an image P, the response r is given by ax + (3y (for some 
parameters a, (3) where the average is taken over all the points of P falling within the 
receptive field. 

Let us examine the response of an LRF of this type to the model and the viewed 
object. Let Pi and P 2 be two pictures in the model set, P is the viewed object, and 
assume that P = aP\ + bP 2 . Let r l5 r 2 and r be the responses of the LRF to Pi, P 2 and 
P respectively. For each pattern, the LRF "sees" only a subset of the points comprising 
the pattern. The other points fall outside the receptive field. If the points seen by the 
LRF in Pi, P 2 and P are corresponding points (even if the pointwise correspondence is 
unknown), then it is clear from the considerations above that f = ar\ + br 2 . In practice, 
some of the points may not have counterparts inside the LRF, but the relation will 
hold approximately provided that the majority of points remain within the limits of the 
receptive field in P l5 P 2 and P. To obtain this condition it is desirable to: (1) use large 
receptive fields, and (2) apply some rough alignment, as suggested in section 2.2 above, 
prior to the match. 

We can now proceed along the following line. Let r = (ri,r 2 , ...,r m ) be an ordered 
set of LRFs. We define a model to be the result of applying this set r to each of the 
model pictures. Given an image /, we first perform a process of rough alignment as 
described earlier, and denote the result by V . We apply the set r to /', and then we 
check whether the result is a linear combination of the model pictures, that is, we look 
for a set {a 1? a 2 , ...,a k ] of coefficients such that for every 1 < i < m it holds that: 

r i (r) = J2a j r t (P J ) (4) 

Practically, since a strict equality can rarely be achieved, we look for a set {a\, a 2 , ..., a^} 
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of coefficients that minimize the difference between the two terms: 

min ||r(r)-f> jr (J>,-)ll (5) 

{ ai ...a fc > . =1 

This problem can be approached, as with the pointwise correspondence, by either 
computing a pseudo-inverse, or by performing the appropriate linear mapping. 

A preliminary stage of rough alignment is required in this scheme to bring each 
point in the image to lie close to a corresponding position in the model (one of the 
model pictures). Consequently, each linear receptive field will contain a relatively large 
proportion of corresponding points. As a result, the application of the set of LRFs to the 
image will yield approximately a linear combination of the results of applying the same 
set of LRFs to the model pictures. The justification for this approximation is given in 
Appendix C. We show there that as the proportion of corresponding points within each 
LRF increases, the result obtained by the application of this set of LRFs to the image 
gets closer to a linear combination of the results obtained by applying these LRFs to the 
model pictures. 

The use of linear receptive fields serves in this scheme two distinct purposes. The first 
is to establish correspondence between subsets of image points, rather than individual 
points. The second is a conversion between two different types of representations. The 
linear mapping method assumes that the position of points is given by the numerical 
values of their x- and ^-coordinates. The input image is given, however, in a different 
representation: a 2-D array of points. The LRF serves to translate the position of a 
point within the receptive field to a value representing the coordinate of the point. Other 
conversion schemes are possible, but the LRF is a simple one that also appears to be 
bilogically palusible. It is interesting to note that cells with linear receptive fields have 
been described in area 7a of macaque monkeys [Zipser h Andersen 1988]. In Zipser & 
Andersen's model these cells also serve the roll of converting position in the plane to a 
firing rate that represents x- or y-coordinate. 



3 General Discussion 

We have proposed above a method for recognizing 3-D objects from 2-D images. In 
this method, an object-model is represented by the linear combinations of several 2-D 
views of the object. It was shown that for objects with sharp edges as well as with 
smooth bounding contours the set of possible images of a given object is embedded 
in a linear space spanned by a small number of views. For objects with sharp edges 
the linear combination representation is exact. For objects with smooth boundaries 
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it is an approximation that often holds over a wide range of viewing angles. Rigid 
transformations (with or without scaling) can be distinguished from more general linear 
transformations of the object by testing certain constraints placed upon the coefficients 
of the linear combinations. 

We have proposed three alternative methods for determining the transformation that 
matches a model to a given image. The first method uses a small set of corresponding 
features identified in both the model and the image. Alternatively, the coefficients can be 
determined using a search. The third method uses a linear mapping as the main step in 
a scheme that maps the different views of the same object into a common representation. 

To avoids the need for pointwise correspondence, we suggested the possible use of 
linear receptive fields to establish approximate correspondence between subsets of points. 

The development of the scheme so far has been primarily theoretical, and initial 
testing on a small number of objects shows good results. Future work should include 
more extensive testing using natural objects, as well as the advancement of the theoretical 
issues discussed below. 

In the concluding section we discuss three issues. First, we place the current scheme 
within the framework of alignment methods in general. Second, we discuss possible 
extensions. Finally, we list a number of general conclusions that emerge from this study. 

3.1 Classes of alignment Schemes 

The schemes discussed in this paper fall into the general class of alignment recognition 
methods. Other alignment schemes have been proposed by Bajcsy k Solina [1987], Chien 
k Aggarawall [1987], Faugeras k Hebert [1986], Fischler k Bolles [1981], Grimson k 
Lozano-Perez [1984], Lowe [1985], Thompson k Mundy [1987]. In an alignment scheme 
we seek for a transformation T a out of a set of allowed transformations, and a model M 
from a given set of models, that minimizes a distance measure d(M,T a ,P) (where P is 
the image of the object). T a is called the alignment transformation, it is supposed to 
bring the model M and the viewed object P into an optimal agreement. 

The distance measure d typically contains two contributions: 

d(M,T a ,P) = d 1 (T a M,P) + d 2 {T a ) 

The first term ^(T^A^P) measures the residual distance between the picture P 
and the transformed model T a M following the alignment, and d 2 (T a ) penalizes for the 
transformation T a that was required to bring M into a close agreement with P. For 
example, it may be possible to bring M into a close agreement with P by stretching it 
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considerably. In this case di(T a M,P) will be small, but, if large stretches of the object 
are unlikely, d 2 (T Q ) wm De large. We will see below that different classes of alignment 
schemes differ in the relative emphasis they place on d\ and d 2 . 

Alignment approaches can be subdivided according to the method used for deter- 
mining the aligning transformation T a . The main approaches used in the past can be 
summarized by the following three categories. 

Minimal alignment. In this approach T a is determined by a small number of cor- 
responding features in the model and the image. Methods using this approach assume 
that the set of possible transformations is restricted (usually to rigid 3-D transformations 
with possible scaling, or a Lie transformation group, [Brockett 1989]), so that the correct 
transformation can be recovered using a small number of constraints. 

This approach has been used by Faugeras k Hebert [1986], Fischler k Bolles [1981], 
Huttenlocher k Ullman [1987], Shoham k Ullman [1988], Thompson k Mundy [1987], 
Ullman [1986, 1989]. In these schemes the term d 2 above is usually ignored, since there is 
no reason to penalize for a rigid 3-D aligning transformation, and the match is therefore 
evaluated by di only. 

The correspondence between features may be guided in these schemes by the labeling 
of different types of features, such as cusps, inflections, blob-centers, etc. [Huttenlocher k 
Ullman 1987, Ullman 1989], by using pairwise constraints between features [Grimson k 
Lozano- Perez 1984], or by a more exhaustive search (as in [Lamdan, Schwartz, k Wolfson 
1987], where possible transformations are pre-computed and hashed). 

Minimal alignment can be used in the context of the linear combination scheme 
discussed in this paper. This method was discussed in Section 2.1. A small number of 
corresponding features is used to determine the coefficients of the linear combination. 
The linear combination is then computed, and the result compared with the viewed 
image. 

Full alignment. In this approach a full correspondence is established between the 
model and the image. This correspondence defines a distortion transformation that 
takes M into P. The set of transformations is not restricted in this approach to rigid 
transformations. Complex non-rigid distortions are included as well. In contrast with 
minimal alignment, in the distance measure d above, the first term di(T a M,P) does 
not play an important role, since the full correspondence forces T a M and P to be in 
close agreement. The match is therefore evaluated by the plausibility of the required 
transformation T a . Our linear mapping scheme in section 2.3 is a full alignment scheme. 
A full correspondence is established to produce a vector that the linear mapping can then 
act upon. 

Alignment search. In contrast with the previous approaches, this metod does not 
use feature correspondence to recover the transformation. Instead, a search is conducted 
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in the space of possible transformations. The set of possible transformations {T a } is 
parametrized by a parameter vector a, and a search is performed in the parameter space 
to determine the best value of a. The deformable template method [Yuille, Cohen, k 
Hallinan, 1989] is an example for this approach. Section 2.2 described the possibility of 
performing such a search in the linear combination approach to determine the value of 
the required coefficients. 



3.2 Extensions 

The linear combination (LC) recognition scheme is restricted in several ways. It will be 
of interest to extend it in the future in at least three directions: relaxing the constraints, 
dealing effectively with occlusions, and dealing with large libraries of objects. We limit 
the discussion below to brief comments on these three issues. 

Relaxing the constraints 

The scheme as presented assumes rigid transformation and an orthographic projection. 
Under these conditions, all the views of a given object are embedded in a low- dimensional 
linear subspace of a much larger space. What happens if the projection is perspective 
rather than orthographic, or if the transformations are not entirely rigid? The effect of 
perspectivity appears to be quite limited. We have applied the LC scheme to objects 
with ratio of distance-to-camera to object-size down to 4:1, with only minor effects on 
the results (less then 3% deviation from the orthographic projection for rotations up to 
45°). 

As for non-rigid transformations, an interesting general extension to explore is where 
the set of views is no longer a linear subspace, but still occupies a low dimensional 
manifold within a much higher dimensional space. This manifold resembels locally a 
linear subspace, but it is no longer "globally straight" . By analogy, one can visualize the 
simple linear combinations case in terms of a 3-D space, in which all the orthographic 
views of a rigid object are restricted to some 2-D plane. In the more general case the 
plane will bend, to become a curved 2-D manifold within the 3-D space. 

This appears to be a general case of interest for recognition as well as for other learning 
tasks. For recognition to be feasible, the set of views {V} corresponding to a given object 
cannot be arbitrary, but must obey some constraints, e.g., in the form F(Vi) = 0. Under 
general conditions, these restrictions will define locally a manifold embedded in the larger 
space. Algorithms that can learn to classify efficiently sets that form low dimensional 
manifolds embedded in high dimensional spaces will therefore be of general value. 

Occlusion 
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In the linear combination scheme we assumed that the same set of points is visible in 
the different views. What happens if some of the object's points are occluded by either 
self- occlusion or by other objects? 

As we mentioned in Section 1.3.5 self-occlusion is handled by representing an ob- 
ject not by a single model, but by a number of models covering its different "aspects" 
[Koenderink & Van Doom 1979]. 

As for occlusion by other objects, this problem is handled in a different manner 
by the minimal alignment and the full alignment versions of the LC scheme. In the 
minimal alignment version, a small number of corresponding features are used to recover 
the coefficients of the linear combination. In this scheme, occlusion does not present a 
major special difficulty. After computing the linear combination, a good match will be 
obtained between the transformed model the visible part of the object, and recognition 
may proceed on the basis of this match. (Alignment search will behave in a similar 
manner.) 

In the linear mapping version, an object's view is represented by a vector v,- of its 
coordinates. Due to occlusion, some of the coordinates will remain unknown. A way of 
evaluating the match in this case in an optimal manner is suggested in Appendix D. 

Multiple models 

We have considered above primarily the problem of matching a viewed object with a 
single model. If there are many candidate models, a question arises regarding the scaling 
of the computational load with the number of models. 

In the LC scheme, the main problem is in the stage of performing the correspondence, 
since the subsequent testing of a candidate model is relatively straightforward. The 
linear mapping scheme is particularly attractive in this regard: once the correspondence 
is known, the testing of a model requires only a multiplication of a matrix by a vector. 

With respect to the correspondence stage, the question is how to perform efficiently 
correspondence with multiple models. This problem remains open for future study, we 
just comment here on a possible direction. The idea is to use pre-alignment to a prototype 
in the following manner. Suppose that Mi,...,Mfc is a family of related models. A 
single model M will be used for representing this set for the purpose of alignment. 
The correspondence T t between each M t in the set and M is pre-computed. Given 
an observed object P, a single correspondence T : M — > P is computed. The individual 
transformations M, — ► P are computed by the compositions T o T t . 
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3.3 General conclusions 

In this section we summarize birefly a number of general characteristics of the linear 
combinations scheme. In this scheme, as in some other alignemnt schemes, significant 
aspects of visual object recognition are more low-level in nature and more pictorial com- 
pared with structural description recognition approaches [e.g., Biederman 1985]. The 
scheme uses directly 2-D views rather than an explicit 3-D model. The use of the 2-D 
views is different, however, from a simple associative memory [Abu-Mostafa &; Psaltis 
1987] where new views are simply compared in parallel to all previously stored views. 
Rather than measuring the distance between the observed object and each of the stored 
views, a distance is measured from the observed object to the linear subspace, (or a low 
dimensional manifold) defined by previous views. 

The linear combination scheme "reduces" the recognition problem in a sense to the 
problem of establishing a correspondence between the viewed object and candidate mod- 
els. The mehtod demonstrates that if a correspondence can be established, the remaining 
computation is relatively straightforward. Establishing a reliable correspondence between 
images is not an easy task, but it is a general task solved by the visual system (e.g. in 
motion measurement and stereoscopic vision), and related processes may also be involved 
in visual object recognition. 

Acknowledgement: We wish to thank E. Grimson, S. Edelman, T. Poggio and 
A. Yuille for helpful comments, T. Poggio also for his suggestions regarding the use of 
two views, and A. Yuille for Appendix B. 



Appendix A 

In section 1.4.2 we showed that the images of an object with smooth surfaces rotating 
in 3-D space can be represented as the linear combination of five views, and mentioned 
that the coefficients for these linear combinations satisfy seven functional constraints. In 
this appendix we list these constraints. 

We use the same notation as in section 1.4.2. Let i2 1? ..., R 5 , R, be 3 x 3 rotation ma- 
trices, and R[, ..., R' b , R! be the corresponding 2x5 matrices defined in section 1.4.2. Let 
ri, ..., r 5 , r be the first row vectors, and Si, ..., s 5 , s the second row vectors of R!^ ..., R'$, R!, 
respectively. In section 1.4.2 we showed that each of the two row vectors of R! is a linear 
combination of the corresponding row vectors of R\, i^,...,i?' 5 . That is, 

5 

r = Yl a «' r * 

t=i 
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t=i 

The functional constraints can be expressed as: 

*2 i *2 i *2 1 

*i + r\ + r 3 = 1 

Si + S £ 2 + 5 3 = 1 

Mi + hh + hh = 

h + r 4 = s 2 + 5 5 

r 2 + r 5 = -(51 + 34) 

(ri + r 4 ) 2 + (Si + S 4 ) = 1 

r 4 s 5 = s 4 r 5 

(Constraints 1,2,3 and 7 are immediate. Constraints 4,5,6 can be verified by expressing 
all the entries in terms of the rotation angles a, ^,7.) 

To express these constraints as a function of the coefficients, every occurrence of a 
term f,-j should be replaced by the appropriate linear combination, as follows: 



n = X>*( r «)j 

i=l 
5 

Sj = HMs»)j 



t=l 



In the case of a similarity transformations (i.e., with scale change) the first two con- 
straints are substituted by: 



r\ + v\ + v\ = 3? + s\ + s 2 3 



Appendix B 

In this appendix we describe a method to find a space of a given dimension, that lies as 
close as possible to a given set of points. 

Let {pi,P2, •••, Pm} be a set of points in W 1 . We would like to find the (n — k) 
dimensional space that lies as close as possible (in the least-square sense) to the points 
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{Pi, P2, —, Pm}- Let P be the n x m matrix given by (pi, p 2 , ..., p m ). Let {u l5 ..., u n ] be 
a set of orthonormal vectors in 7l n , and define U k = span{u k+ i, ..., u n }. The sum of the 
distances (squared) of the points Pi,P2, ...,Pm from U k is given by: 

D\U k ) = £ || /*«.« || 2 

(Since Z)i=i(Pt' w ») 2 * s the squared distance of p, from U k .) 
Let F = PP t . Then: 

Jfc A; A; 

t=i t'=i t'=i 

Any real matrix of the form XX*, is symmetric and non-negative. Therefore, F has n 
eigenvectors and n real non-negative eigenvalues. Assume that the {u\, ...,u n } above are 
the eigenvectors of F with eigenvalues A x < A 2 < ... < A n respectively, then Fu{ = A,!/,-, 
and therefore: 

t=i 
Claim: Let {A 1? ..., A^} be the k smallest eigenvalues of F, then: 

EA,= mini) 2 (V fc ) 

t'=l k 

Where the minimum is taken over all the linear subspaces of dimension n — k. Therefore, 
span{uk+i, ..., u n ] is the best (n — k) dimensional space through pi, p 2 , ..., p m - 

Proof: Let V*. be a linear subspace of dimension (n — k). We must establish that: 

D\V k ) > D\U k ) 

Let {t>i, ..., v n } be a set of orthonormal vectors in 1Z n such that V k = span{v k+ i, ..., v n }. 
V = («!, ..., v n ), and U = (iti, ..., u n ) are n x n orthonormal matrices. Let: 

R = U*V 

Then: 

UR= V 

That is: 

n 
t'=l 
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R is also orthonormal, therefore: 

n n 

t'=i j=i 

Now: 

n n 

t=l t=l 

And therefore: 

n n 



t'=l t'=l 



Since u\uj = 8{j we obtain that: 



1 = 1 

Therefore: 

^ 2 (V fc ) = E ^ = ft r?-A.- = E(E ^ 

j=l j=l t'=l 1=1 j=l 

Let: 

k 

a i = E r 5 
j=i 

Then: 

D 2 (V k ) = J2cx i X i 
t=i 

Where < a { < 1 and £?=i a,- = A;. 

The claim we wish to establish is that the minimum is obtained when a, = 1 for 
i = l...jfe, and a; = for i = k -f l...n. Assume that for V^ there exists 1 < m < k such 
that o: m < 1, and k + 1 < / < n such that ct\ > 0. We can decrease ai and increase a m 
(by min(a/,l — am)), and this cannot increase the value of D 2 {Vk)- By repeating this 
process we will eventually reach the value of D 2 (Uk). Since during this process the value 
cannot increase, we obtain that: 

D 2 (U k ) < D\V k ) 

And therefore: 

»=i Vk 
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Appendix C 

In this appendix we establish that in the method using linear receptive fields the ap- 
proximation improves with the proportion of corresponding points within each recep- 
tive field, and derive a bound on the error. We are given a set of points in the im- 
age p = (pi,...,pn) that fall within a given receptive field, and k sets of model points 
Pi = (Pn, -,Pini), ••-, Pk = (jPku -,Pkn k ) that fall within the same receptive field. Let p 
be the average of pi,...,pa, and p t the average of p iu ...,p tn . for every 1 < i < k. We next 
show that the difference between p and the linear combination of pi, ...,p k is bounded 
by a term which is proportional to the relative number of corresponding points falling 
within the receptive field. 

Claim: For some given constants a u ...,a k , let / be the largest index such that for 
every 1 < j < I it holds that £?=i a { pij = pj. Denote n = max{ni, ...,njt, n}, d = 
max iij(fc {|p tj - p ik \ , \Pj - Pk\] and q = 1 - £, then: 



p - S a 'P* 



t=l 



<# + EW) 



t'=i 



(where d is the diameter of the receptive field). 

Proof: Let us first extend the sets of points in such a manner that each will have 
the same number of points, n. We will do so by setting pij = p t for every 1 < i < fc, 
rii < j < n, and let pj = p for every h < j < n. We now have a new set of vectors 
Pi, — , Pk, P each of length n, all having the same averages they had originally. Therefore: 



P - 2 a »P» 

t=i 



a>i 



-l^Pj-Z^-l^PiJ 



i=i 



• 1 rc ■ 1 
t=i j=i 



Ys(Pj-Y, a iPij) 
j=l t=l 

1 n 



i n 
<-£ 



Pj - J2 a iPiJ 



t'=l 



Pj - Ys a ipij 



t'=l 



Now, let dij = p^ — pn and dj — pj — pi, we obtain: 



P - E a *P* 



t=i 



1 n 

< E 



J=i + 1 



Pi - E a #«i 



t=i 



1 n 
3=1+1 



Pi + dj -£a«(p»l + c ^') 

t'=l 



1 n 

= E 



J=l+1 



dj — 22 a *^»i 
t=i 



< 
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< - E (Mil + E N K-l) ^ ^ + E N) 
n j=/+i t=i t=i 

Therefore, the difference lp - £i=i a,-p t -| is bounded by a term which is proportional to 

q- 

iFiom this claim we can conclude the following: Let p u ..., p k be the values obtained 
by applying a linear receptive field to the pictures of a given model, and let p be the value 
obtained by applying the same LRF to a given image. If the image can be presented 
as a linear combination of the model pictures, then the error p - £i=i a,p t is bounded 
by a term which is proportional to q. Therefore we can in principle reduce this term by 
reducing q, that is, by constructing the LRF such that it will cover more corresponding 
points of each picture. 



Appendix D 

In the linear mapping method a matrix L was constructed that maps every legal view v 
of the object to a constant output vector. If the common output is chosen to be the zero 
vector, then Lv = for any legal view of the object. 

In this appendix we consider briefly the case where the object is only partially visible. 
We model this situation by assuming that we are given a partial vector p. In this 
vector the first k coordinates are unknown, due to the occlusion, and only the last n — k 
coordinates are observable. (A partial correspondence between the occluded object and 
the model is assumed to be known.) 

In the vector p we take the first k coordinates to be zero. We try to construct from 
p a new vector p' by supplementing the missing coordinates so as to minimize || Lp' ||. 
The relation between p and p' is: 

k 

p' = p + E a « u * 

where the a; are unknown constants, and the u, are unit vectors along the first k coor- 
dinates. 

In matrix notation, we seek to complement the occluded view by minimizing: 

min II Lp + LUa. II 

Where the columns of the matrix U are the vectors u t and a is the vector on the unknown 

a;'s. 
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The solution to this minimization problem is: 

a=-[LC/] + Lp 

(where H+ denotes the pseudo- inverse of the matrix H). This means that the pseudo- 
inverse (LU) + will have to be computed. The matrix L is fixed, but U depends on the 
points that are actually visible. 

This optimal value of a can also be used to determine the output vector of the 
recognition process Lp': 

Lp' = (7 - [LU][LU] + )Lp 

p is then recognized as a legal view if this output is sufficiently close to zero. 
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